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Abstract 



o 

(-^ This paper describes a novel approach to changepoint detection when the observed high-dimensional data may 

r-j have missing elements. The performance of classical methods for changepoint detection typically scales poorly 

with the dimensionality of the data, so that a large number of observations are collected after the true changepoint 
before it can be reliably detected. Furthermore, missing components in the observed data handicap conventional 
approaches. The proposed method addresses these challenges by modeling the dynamic distribution underlying 
■ the data as lying close to a time-varying low-dimensional submanifold embedded within the ambient observation 

space. Specifically, streaming data is used to track a submanifold approximation, measure deviations from this 
approximation, and calculate a series of statistics of the deviations for determining when the underlying manifold 
has changed in a sharp or unexpected manner. The approach described in this paper leverages several recent results 
in the field of high-dimensional data analysis, including subspace tracking with missing data, multiscale analysis 
techniques for point clouds, online optimization, and changepoint detection performance analysis. Simulations 
and experiments highlight the robustness and efficacy of the proposed approach in detecting an abrupt change in 
, an otherwise slowly- varying low-dimensional manifold. 

^ ! 1 Introduction 

\q _ Changepoint detection is a form of anomaly detection where the anomalies of interest are abrupt temporal changes 
in a stochastic process [1,2]. A good changepoint detection algorithm will accept a sequence of random variables 
| whose distribution may change abruptly at one time, detect such a change as soon as possible, and also have long 
qq ■ period between false detections. In many modern applications, the stochastic process is non-stationary away from 
the changepoints and very high dimensional, resulting in significant statistical and computational challenges. For 
s ! ■ instance, we may wish to quickly identify changes in network traffic patterns [3], social network interactions [4], 
surveillance video [5], or solar flare imagery [6,7]. 

Traditional changepoint detection methods typically deal with a sequence of low-dimensional, often scalar, random 
variables. Naively applying these approaches to high-dimensional data is impractical because the underlying high- 
dimensional distribution cannot be accurately estimated and used for developing test statistics. This results in 
detection delays and false alarm rates that scale poorly with the dimensionality of the problem. Thus the primary 
challenge here is to develop a rigorous method for extracting meaningful low-dimensional statistics from the high- 
dimensional data stream without making restrictive modeling assumptions. 

Our method addresses these challenges by using multiscale online manifold learning to extract univariate change- 
point detection test statistics from high-dimensional data. We model the dynamic distribution underlying the data 
as lying close to a time-varying, low-dimensional submanifold embedded within the ambient observation space. This 
submanifold model, while non-paramctric, allows us to generate meaningful test statistics for robust and reliable 
changepoint detection, and the multiscale structure allows for fast, memory-efficient computations. Furthermore, 
these statistics can be calculated even when elements are missing from the observation vector. 

While manifold learning has received significant attention in the machine learning literature [8-14], online learning 
of a dynamic manifold remains a significant challenge, both algorithmically and statistically. Most existing methods 
are "batch" , in that they are designed to process a collection of independent observations all lying near the same 
static submanifold, and all data is available for processing simultaneously. 

In contrast, our interest lies with "online" algorithms, which accept streaming data and sequentially update 
an estimate of the underlying dynamic submanifold structure, and changepoint detection methods which identify 
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significant changes in the submanifold structure rapidly and reliably. Recent progress for a very special case of 
submanifolds appears in the context of subspace tracking. For example, the Grassmannian Rank-One Update 
Subspace Estimation (GROUSE) [15] and Parallel Estimation and Tracking by REcursive Least Squares (PETRELS) 
[16] [17] effectively track a single subspace using incomplete data vectors. The subspace model used in these methods, 
however, provides a poor fit to data sampled from a manifold with non-negligible curvature. 

1.1 Related work 

At its core, our method basically tracks a time-varying probability distribution underlying the observed data, and 
uses this distribution to generate statistics for effective changepoint detection. For sequential density estimation 
problems such as this, it is natural to consider an online kernel density estimation (KDE) method. A naive variant 
of online KDEs would be quite challenging in our setting, however, because if we model the density using a kernel 
at each observed data point, then the amount of memory and computation required increases linearly with time 
and is poorly suited to large-scale streaming data problems. Ad-hoc "compression" or "kernel herding" methods for 
online kernel density estimation address this challenge [18] but face computational hurdles. Furthermore, choosing 
the kernel bandwidth, and particularly allowing it to vary spatially and temporally, is a significant challenge. Recent 
works consider variable bandwidth selection using expert strategies which increase memory requirements [19,20]. 
Some of these issues are addressed by the RODEO method [21], but the sparse additive model assumed in that 
work limits the applicability of the approach; our proposed method is applicable to much broader classes of high- 
dimensional densities. Finally, in high-dimensional settings asymmetric kernels which are not necessarily symmetric 
or coordinate-aligned appear essential for approximating densities on low-dimensional manifolds, but learning time- 
varying, spatially-varying, and anisotropic kernels remains an open problem. In a sense, our approach can be 
considered a memory-efficient sparse online kernel density estimation method, where we only track a small number 
of kernels, and we allow the number of kernels, the center of each kernel, and the shape of each kernel to adapt to 
new data over time. 

Our approach also has close connections with Gaussian Mixture Models (GMMs) [22-24]. The basic idea here is 
to approximate a probability density with a mixture of Gaussian distributions, each with its own mean and covariance 
matrix. The number of mixture components is typically fixed, limiting the memory demands of the estimate, and 
online expectation- maximization algorithms can be used to track a time- varying density [25]. However, this approach 
faces several challenges in our setting. In particular, choosing the number of mixture components is challenging even 
in batch settings, and the issue is aggravated in online settings where the ideal number of mixture components may 
vary over time. Furthermore, without additional modeling assumptions, tracking the covariance matrices for each of 
the mixture components is very ill-posed in high-dimensional settings. 

Our approach is closely related to Geometric Multi- Resolution Analysis (GMRA) [14], which was developed for 
analyzing intrinsically low-dimensional point clouds in high-dimensional spaces. The basic idea of GMRA is to first 
iteratively partition a dataset to form a multiscalc collection of subsets of the data, then find a low-rank approximation 
for the data in each subset, and finally efficiently encode the difference between the low-rank approximations at 
different scales. This approach is a batch method without a straightforward extension to online settings. 

1.2 Motivating applications 

The proposed method is applicable in a wide variety of settings. Consider a video surveillance problem. Many 
modern sensors collect massive video streams which cannot be analyzed by human due to the sheer volume of data; 
for example, the ARGUS system developed by BAE Systems is reported to collect video-rate gigapixel imagery [26,27], 
and the Solar Dynamics Observatory (SDO) collects huge quantities of solar motion imagery "in multiple wavelengths 
to [help solar physicists] link changes in the surface to interior changes" [28]. Solar flares have a close connection 
with geomagnetic storms, which can potentially cause large-scale power-grid failure. In recent years the sun has 
entered a phase of intense activity, which makes monitoring of solar flare bursts an even more important task [7]. 
With these issues in mind, it is clear that somehow prioritizing the available data for detailed analysis is an essential 
step in the timely analysis of such data. If we can reliably detect statistically significant changes in the video, we 
can focus analysts' attention on salient aspects of the dynamic scene. For example, we may wish to detect a solar 
flare in a sequence of solar images in real time without an explicit model for flares, or detect anomalous behaviours 
in surveillance video [29]. Salicncy detection has been tackled previously [30,31], but most methods do not track 
gradual changes in the scene composition and do not detect temporal changepoints. 

A second motivating example is credit history monitoring, where we are interested in monitoring the spending 
pattern of a user and raising an alarm if a user's spending pattern is likely to result a default [32]. Here normal 
spending patterns may evolve over time, but we would expect a sharp change in the case of a stolen identity. 
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An additional potential application arises in computer network anomaly detection [33]. Malicious attacks or 
network failure can significantly affect the characteristics of a network [3,34]. Recent work has shown that network 
traffic data is well-characterized using submanifold structure [35], and using such models may lead to more rapid 
detection of changepoints with fewer false alarms. 

1.3 Contributions and paper organization 

The primary contributions of this work are two-fold: we present (a) a fast method for online tracking of a dynamic 
submanifold underlying very high-dimensional noisy data with missing elements and (b) a principled changepoint 
detection method based on easily computed residuals of our online submanifold approximation. These methods are 
supported by both theoretical analyses and numerical experiments on simulated and real data. 

The paper is organized as follows. In Section 2 wc formally define our setting and problem. Section 3 describes 
our multiscale submanifold model and tracking algorithm, which is used to generate the statistics used in the 
changepoint detection component described in Section 4. Several theoretical aspects of the performance of our 
method are described in Section 5, and the performance is illustrated in several numerical examples in Section 6. 

2 Problem Formulation 

Suppose we are given a sequence of data Xi,X2, ■ • ■ ,, and for t = 1,2,..., x t G M. D , where D denotes the ambient 
dimension. The data are noisy measurements of points lying on a submanifold v t G Ait- 

x t — v t + w t . (1) 

The intrinsic dimension of the submanifold Ait is d. We assume d <C D. The noise wt is a zero mean white Gaussian 
random vector with covariance matrix a 2 I. The underlying submanifold Ait may vary slowly with time. At each 
time t, we only observe a partial vector Xt at locations fit G {1, ... , D}. Let Vo, t represent the |O t | x D matrix that 
selects the axes of MP indexed by £l t ', we observe Vn t x t- 

Our goal is to design an online algorithm that generates a sequence of approximations Ait which tracks Ait when 
it varies slowly, and detects anomalies as soon as possible when the submanifold changes abruptly. The premise is 
that the statistical properties of the tracking error will be different when the submanifold varies slowly versus when 
it changes abruptly. 

Define the operator 

VMXt = arg min \\x - x t \\ 2 (2) 

as the projection of observation xt on to Ai. If we had access to all the data simultaneously without any memory 
constraints, we might solve the following batch optimization problem using all data up to time t for an approximation: 

t 

M° t = arg mm { ^ a* - *^ (a* - V M x l )\\ 2 +/J,pen(M)\, (3) 
i=i 

where ||a;|| denotes the Euclidean norm of a vector x, pen(.M) denotes a regularization term which penalizes the 
complexity of Ai, a G (0, 1] is a discounting factor on the approximation error at each time t, and /i is a user- 
determined constant that specifies the relative weights of the data fit and regularization terms. 

Note that (3) cannot be solved without retaining all previous data in memory, which is impractical for the 
applications of interest. To address this, we instead consider an approximation to the cost function in (3) of the form 
F(Ai) + /xpcn(A^). To develop an online algorithm, instead of solving (3), we find a sequence of approximations 
Aii, . . . ,Ait (without storing historic data), such that Ait+i is computed by updating the previous approximation 
Ait using F(Ai) and the current datum Xt+i- In Section 3, we will present several forms of F(Ai) that lead to 
recursive updates and efficient tracking algorithms. One example of a MOUSSE approximation is illustrated in 
Figure 1; the context is described in more detail in Section 6.2. 

Given the sequence of submanifold estimates Aii, ■ ■ . , Ait, we can compute the distance of each xt to Ait, which 
we denote {et}. We then apply changepoint detection methods to the sequence of tracking errors {et}. In particular, 
we assume that when there is no anomaly, the are i.i.d. with distribution vq. When there is an anomaly, there 
exists an unknown time n < t such that before the changepoint e\,...,e K are i.i.d. with distribution vq, and after 
the changepoint, e re+ i, . . . are i.i.d. with distribution v\. Our goal is to detect the anomaly as quickly as possible 
after it occurs, and make as few false alarms as possible. 
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Figure 1: Approximation of MOUSSE at t = 250 (upper) and t = 1150 (lower) of a 100-dimensional submanifold. 
In this figure we project everything into three-dimensional space. The blue curve corresponds to true submanifold, 
the dots are noisy samples from the submanifold (the lighter dots are more dated than the darker dots), and the red 
line segments are the approximation with MOUSSE. As the curvature of the submanifold increases, MOUSSE also 
adapts in the number of line segments. 

3 Multiscale Online Union of Subspace Estimation (MOUSSE) 

In the following we describe the Multiscale Online Union of SubSpaccs Estimation (MOUSSE) method, including 
the underlying multiscale model and online update approaches. 

3.1 Multiscale union of subspaces model 

MOUSSE uses a union of low-dimensional subsets Mt to approximate Mt, and organizes these subsets using a tree 
structure. The idea for a multiscale tree structure is drawn from the multiscale harmonic analysis literature [36]. The 
leaves of the tree are subsets that are used for the submanifold approximation. Each node in the tree represents a local 
approximation to the submanifold at one scale. The parent nodes are subspaces that contain coarser approximations 
to the submanifold than their children. The subset associated with a parent node roughly covers the subsets associated 
with its two children. 

More specifically, our approximation at each time t consists of a union of subspaces Sj t k,t that is organized using 
a tree structure. Here j € {1, . . . , J t } denotes the scale or level of the subset in the tree, where J t is the tree depth at 
time t, and k £ {1, . . . , 2 J } denotes the index of the subset for that level. The approximation Mt at time t is given 
by: 

M t = |J <W (4) 

U-k)eA t 

where At contains the indices of all leaf nodes used for approximation at time t. Also define 7t to be the set of 
indices of all nodes in the tree at time t, with 

At c T t . 

Each of these subsets lies on a low-dimensional hypcrplanc with dimension d and is parameterized as 

S j>k ,t = {veR D :v = Uj, k , t z + Cj, k , t , z T A~l t z < 1, z£ R d }. (5) 

where the notation T denotes transpose of a matrix or vector. The matrix Uj t k,t £ IR- *'' is the subspace basis, and 
Cj.k.t € is the offset of the hyperplane from the origin. The diagonal matrix 

A JiM ^diag{AW t ,...,AS t }e]R d x d , 

with X^j, t >...> Xjl t > 0, contains eigenvalues of the covariance matrix of the projected data onto each subspace. 
This parameter specifies the shape of the ellipsoid by capturing the spread of the data within the subset. In summary, 
the parameters for Sj t k,t ar e 

{Uj,k,t,Cj t k,t,&j,k,t}(j,k)eT t , 
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and these parameters will be updated online. 

In our tree structure, the leaf nodes of the tree also have two virtual children nodes that keep necessary information 
used when further partitioning is needed. The complexity of the approximation is defined to be the total number of 
subsets used for approximation at time t: 

Kt±\M (6) 



which is used as the complexity regularization term in (3) 

pcn(A? t ) 4 K t . 

The tree structure is illustrated in Figure 2. 
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Figure 2: Illustration of tree structure for subspaces. The subspaces used in our approximation are {iSi,o,t U £2,2,* U 

<->2,3,t}- 



3.2 MOUSSE Algorithm 

When a new sample xt+i becomes available, MOUSSE updates Ait to obtain Ait+i- The update steps are presented 
in Algorithm 1; there are three main steps, detailed in the below subsections: (a) find the subset in the Ait which 
is closest to Xt+i, (b) update a tracking estimate of that closest subset and its ancestors, and (c) grow or prune the 
tree structure to preserve a balance between fit to data and complexity. We use [z] m to denote the m-th element of 
a vector z, and z T to denote the transpose of a matrix or vector z. 

3.3 Distances for MOUSSE 

To update the submanifold approximation, we first determine the affinity of Xt+i to each subset. We might use 
Euclidean distance, but this distance is problematic since in our approximation each subspace is local with boundary 
defined by an ellipsoid. Hence, a point can be close to a hyperplane but far away from the center of the ellipsoid. An 
alternative choice is the Mahalanobis distance, but this distance is only finite and well-defined for points lying in one 
of the low dimensional subspaces in our approximation. Since our construction is a pieccwise linear approximation 
to a submanifold which may have some curvature, we anticipate many observations which are near but not in our 
collection of subspaces, and we need a well-defined, finite distance measure for such points. 

To address these challenges, we introduce the approximate Mahalanobis distance of a point x to a subspace S. 
Assume x with support f2, parameter S, and the parameters for a set S is given by {U, c, A}. Define 

U n =VnU eR^ xd , c n ±V Q c£M.M, 

and 

xn = V n xeM} Q{ . 

Define the pseudoinverse operator that computes the coefficients of a vector in the subspace spanned by V as 

V* ^ (V T V)~ 1 V T . (8) 
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Algorithm 1 MOUSSE 



1: Input: 

error tolerance e, step size a, relative weight fj. 

2: Initialize tree structure, set eo = 

3: for t = 0, 1, . . . do 

4: Given new data Xt+i and its support f2t+i, find the minimum distance set Sj* t k*,t according to {j*,k*) = 

argmm 0iA . )eA ps J>kt (xt+i,S jtkt t) using (13) 

5: Calculate: e t +\ — using (15) 

6: Update all ancestor nodes and closest virtual child node of (j* , k*) using Algorithm 2 

7: Calculate: e t +i = ae t + (1 — a ) e t+i 

8: Denote parent node of (j* , k*) as (j* — 1, k p ) and virtual child node closest to Xt+i as (j* + 1, k v ) 

9: if e t+ i > e and d(a: t +i, <S,-» +i,fc„,t) + At(i^ + 1) < ef +1 + /j,K t then 

10: Split (j*,k*) using Algorithm 3 

11: end if 

12: if e t+1 < e and d(x t+ i, Sj*-i, kp ,t) + n(K t - 1) < ef +1 + /iif t then 

13: Merge (j*, k*) and its sibling using Algorithm 4 

14: end if 

15: Update At and 7t 

16: end for 



Algorithm 2 Update node 

1: Input: node index (j,k), a, S and subspacc parameters 

2: Calculate: /? and /3j_ using (9) and (10) 

3: Update: [c^ k ,t+\]m = ot[ c j,k,t\m + (1 - a)[x t +i] m , m G Ot+1 

4: Update: A^ +1 = oA^ t + (1 - a) 0ft, m = 1, . . . , d 

5: Update: S j>k ,t+i = a6 jik ,t + (1 - a)\\/3 ± \\ 2 / (D - d) 

6: Update basis Uj jkj t using (modified) subspace tracking algorithm 



Algorithm 3 Split node (j, k) 



1: Turn two virtual children nodes (j + 1, 2fc) and (j + 1, 2fc + 1) of node (j, k) into leaf nodes 
2: Initialize virtual nodes (j + 1, 2fc) and (j + 1,2k + 1): 



fci = 2fc 
k 2 =2k + l 



Cj+i,ki,t+i = Cj,k,t + \J X^l !t uf kt /2 

Cj+l,k 2 ,t+l = c j>k ,t - \J >^lt u 3\lt/ 2 
Uj + i t kt,t+i = Uj, k ,t, i = l,2 

A (m) - A (m) to = 2 d 



Algorithm 4 Merge (j, fc) and its sibling 
1: Make the parent node of (j, k) into a leaf node 

2: Make (j, fc) and its sibling into virtual children nodes of the newly created leaf 
3: Delete all four virtual children nodes of (j, k) and its sibling 



When U is an orthogonal matrix, we have U* = U T , but in general U$ ^ U^. Let 



= Ug(x a -cn), (9) 
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and 

P± = (I-U a U$)(xn-cn). (10) 

In this definition, (3 is the projection of x on U , and captures the energy of the projection residual. We denote 

Euclidean distance between x with support f2 and the subspace where S lies on as 

d(x,S) = \\x a - cq - U n U*(x n - c n )\\ 2 = \\(3 ± \\ 2 . (11) 

Next we introduce the approximate Mahalanobis distance, which is a hybrid of Euclidean distance and Mahalanobis 
distance. Mahalanobis distance is commonly used for data classification, which measures the quadratic distance of 
x to a set S of data with mean c = E{x} and covariance £ = E{(.t — c)(x — c) T }. Specifically, the Mahalanobis 
distance is defined as 

g(x,S) = (x — c) — c). 

Assuming the covariance matrix has a low-rank structure with d large eigenvalues and D — d small eigenvalues, we 
can write the eigendecomposition of the covariance matrix S as 

E=[U U±]A[U Ua_] T = UA 1 U T + U^A 2 Uj, 

where A = diag{Ai, . . . , A^}, Ai > . . . > A,d, Ai = diag{Ai, . . . , A^}, A2 = diag{Ad+i, . . . , Ad}. If we further assume 
that the D — d small eigenvalues are all approximately equal to S, i.e., A2 ps 51, then 

q(x,S) ps (x - c) T UA^U T (x - c) + d^WUlix - c)\\ 2 . (12) 

From here we introduce the approximate Mahalanobis distance when the data is low-dimensional: 

Ps (x,S)^p T A- 1 f3 + S- 1 \\p ± \\ 2 . (13) 

Note that when the data is complete, p$(x,S) is equal to the right-hand-side of (12). With missing data, p$(x,S) is 
an approximation to g(x,S). 



3.4 Update subset parameters 

When updating subspaces, we can update all subspaces in our multiscale representation and make the update step-size 
to be inversely proportional to the approximate Mahalanobis distance between the new sample and each subspace, 
which we refer to as the "update-all" approach. Alternatively, we can just update the subspace closest to xt+i, its 
virtual children, and all its ancestor nodes, which we refer to as the "update-nearest" approach. The update-all 
approach is computationally more expensive, especially for high dimensional problems, so we focus our attention on 
the greedy update-nearest approach. The below approaches extend readily to the update-all setting, however. 

With the approximate Mahalanobis distance defined above, we can find the subset with minimum distance to the 
new datum 

(j*,k*) = argmin#5 (x t+1 ,S jtktt ), 

and then update the parameters of that subset, all its ancestors in the tree, and its two virtual children. The update 
algorithm is summarized in Algorithm 2 which denotes the parameters associated with Sj*.k*,t &s (c,U, A,d), and 
drops the j* , k* , and t indices for simplicity of presentation. The update of the center c, A and 6 are straightforward. 
Using the definition in (2), we have 

V Ut x = Vl U n U* (x n ~ c n ) + c; (14) 

that is, the projection onto the submanifold approximation is the projection onto the nearest subset. We further 
define the instantaneous approximation error of the submanifold at time t as: 

et = \\Vn t (xt-V^x t )\\, (15) 

and note that this is equivalent to the norm of the orthogonal projection of x t onto <Sj» k» ti denoted ft j_; i.e., 

e t = ||ft,J|- (16) 
Next we will focus on three approaches for updating U. 
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3.4.1 GROUSE 

To use GROUSE subspace tracking in this context, we approximate the first term in (3) as 



F{M) = ^ t+1 - l \\VnM ~ ^M^i)\\ 2 + II^Vi (^+1 ~ PMXt+JW 2 . (17) 

4=1 

Note the first term is a constant with respect to M., so we need only to consider the second term in computing an 
update. The basic idea is now to take a step in the direction of the instantaneous gradient of this cost function. 
Since M. is constrained to be a union of subsets and the projection operator maps to the closest subset, this task 
corresponds to the basis update of GROUSE [15] with the cost function 

/([/) 4 min ||P Qt+1 (x t+1 ~Ua~ c)|| 2 (18) 

a 

(assuming U is orthonormal and including the offset vector c). Following the same derivation as in [15], we have that 

= -2Vu t+1 (x t+ i - c - U(5)I3 T 4 -2r(3 T , (19) 

where j3 is defined in (9), and 

r = Vtit+i ( x t+i —c-Uff). 
The gradient on the Grassmannian is given by 

V/ = (I - UU T )jL = -2(1 - UU T )rf3 T = -2r(3 T , 

since U T r = 0. We obtain that the update of Ut using the Grassmannian gradient is given by 

Ut+1 = Ut + T P#T + shy Mff 

where rj > is the step-size, and £ = ||r||||J7 t /3||. The step-size r\ is chosen to be -q = ?7o/||a; t+ i||, for a constant rjo > 0. 



3.4.2 PETRELS 

Let a;(i), X(2), X(3), . . . denote a subsequence of the data such that each xu\ was drawn from the (J*, k*) node in our 
multiscale approximation, and let n t denote the length of this subsequence. Then we can approximate F(M) as 

Tit 

F(M) - «"'" l m in ll^n w (x W - c - Uz)\\\ (20) 
t=i 

where, as before, U and c correspond to the subspace parameters for subset Sj*^*,t- The minimization of F(M.) in 
(20) can be accomplished using the PETRELS algorithm [37], yielding a solution which can be expressed recursively 
as follows. Denoting by [U] m the m-th column of U, we have the update of U given by 

[U t +i] m = [U t ] m + Irnen t ([Uta t +i\ m - o t+1 [U t } m )(R m ,t+i)*a t+ i, (21) 

for m = 1, . . . ,D, where Ia is the indicator function for event A, (■)# denotes pseudo-inverse of a matrix, and 

a t+1 = {UjVn t+1 U t )*Ujx t+1 . 

The second-order information in R mtt +i can be computed recursively as 

(Rm.t+i)* = a-\R m . t )* + ^T P 7i + \^ (R m ,t)*a t aJ(R m , t )*. (22) 

1 + a L al +1 (R mt t)*a t+ i 

Note that PETRELS does not guarantee the orthogonality of Ut+i, which is important for quickly comput- 
ing projections onto our submanifold approximation. To obtain orthonormal Ut+\, we may apply Gram-Schmidt 
orthonormalization after each update. We refer to this modification of PETRELS as PETRELS-GS. This orthogo- 
nalization requires an extra computational cost on the order of 0(Dd 2 ) and may compromise the continuity of Ut, 
i.e., the Frobenius norm ||C/t+i — Ut\\F after the orthogonalization may not be small even when the corresponding 
subspaces are very close [38]. As a result, this orthogonalization may change the optimality of Ut- A faster orthonor- 
malization (FO) strategy with less computation which also preserves the continuity of Ut is given in [38]. We refer 
to this FO strategy combined with PETRELS as PETRELS-FO. 
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3.4.3 Computational complexity 



For each update with complete data (the maximum computational complexity), the computational complexity of 
GROUSE is on the order of O(Dd), PETRELS-GS is 0(Dd 2 ), and PETRELS-FO is O(Dd). More details about the 
relative performance of these three subspace update methods can be found in Section 6. 

3.5 Tree structure update 

When the curvature of the submanifold changes and cannot be sufficiently characterized by the current subset 
approximations, we must perform adaptive model selection. This can be accomplished within our framework by 
updating the tree structure - growing the tree or pruning the tree, which we refer to as "splitting" and "merging" 
branches, respectively. Previous work has derived finite sample bounds and convergence rates of adaptive model 
selection in nonparametric time series prediction [39]. 

To decide whether to change the tree structure, we introduce the average approximation error. 



This error is an approximation to the first term in (3), where we replace Vj^ with the projection onto a sequence 
of approximations Prj • We will consider changing the tree structure when e t is greater than our prescribed error 
tolerance e > 0. 

Splitting tree branches increases the resolution of the approximation at the cost of higher estimator complexity. 
Merging reduces resolution but lowers complexity. When making decisions on splitting or merging, we take into 
consideration the approximation errors as well as the model complexity (the number of subspaccs K t used in the 
approximation). This is related to complexity-regularized tree estimation methods [36,40, 41] and the notion of 
minimum description length (MDL) in compression theory [42,43]. In particular, we use the sum of the average 
fitting error and a penalty proportional to the number of subspaces used for approximation as the cost function when 
deciding to split or merge. The splitting and merging are summarized in Algorithm 3 and Algorithm 4. The splitting 
process mimics the fc- means algorithm. In these algorithms, note that for node (j, k) the parent is node — \_k/2\) 
and the sibling node is (j, k + 1) for k even or (j, k — 1) for k odd. 

3.6 Initialization 

To initialize MOUSSE, we assume a small initial training set of samples, and perform a nested bi-partition of the 
training data set to form a tree structure, as shown in Figure 2. The root of the tree represents the entire data set, 
and the children of each node represent a bipartition of the data in the parent node. The bipartition of the data 
can be performed by the fc-means algorithm. We start with the entire data, estimate the sample covariance matrix, 
perform an eigendecomposition, extract the d-largest eigenvectors and eigenvalues and use them for E/i.i,o and Al i o, 
respectively. The average of the (D — d) minor eigenvalues are used for o- If t ne approximation error is greater 
than the prescribed error tolerance e, we further partition the data into two clusters using fc-means (for k = 2) and 
repeat the above process. We keep partitioning the data until Sj t k,o is less than e for all leaf nodes. Then we further 
partition the data one level down and form the virtual nodes. This tree construction is similar to that used in [14]. 

In principle, it is possible to bypass this training phase and just initialize the tree with a single root node and 
two random virtual children nodes. However, the training phase makes it much easier to select algorithm parameters 
such as e and provides more meaningful initial virtual nodes, thereby shortening the "burn in" time of the algorithm. 

3.7 Choice of parameters 

In general, a should be close to 1, as in the Recursive Least Squares (RLS) algorithm [44]. In the case when the 
submanifold changes quickly, we would expect smaller weights for approximation based on historical data and thus 
a smaller a. In contrast, a slowly evolving submanifold requires a larger a. In our experiments, a ranges from 0.8 
to 0.95. e controls the data fit error, which varies from problem to problem according to the smoothness of the 
submanifold underlying the data and the noise variance. Since the tree's complexity is controlled and pcn(A^) in (3) 
is roughly on the order of 0(1), we usually set \i close to e. 




(23) 



i=i 
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4 Changepoint detection 



We are interested in detecting changes to the submanifold that arise abruptly and change the statistics of the data. 
When the submanifold varies slowly in time, MOUSSE described in Section 3 can track the submanifold and produce 
a sequence of stationary tracking errors. When an abrupt change occurs, MOUSSE loses track of the manifold and 
results in an abrupt increase in tracking errors. Hence, using tracking errors, we can develop a changepoint detection 
algorithm to detect abrupt changes in the submanifold. 



4.1 CUSUM procedure 

We adopt the widely used statistical CUSUM procedure [2, 45] for changepoint detection. In particular, we assume 
that vq is a normal distribution with mean /zo and variance <7q, and V\ is a normal distribution with mean \x\ and 
the same variance ctq. Then we can formulate the changepoint detection problem as the following hypothesis test: 



H : ei,...,e t ~JV(/j o ,0o) 

Hi: ei,...^ ~JV(/tio,<7o), e K+ i, . . . , e t ~ J\f(fii, Ctq) 



.r, 2, ( 24 ) 



We assume (1q and Ctq are known since typically there is enough normal data to estimate these parameters. (When 
the training phase is too short for this to be the case, these quantities can be estimated online, as described in [46].) 
However, we assume [i\ is unknown since the magnitude of the changepoint can vary from one instance to another. 
In forming the detection statistic, we replace fii by its maximum likelihood estimate (for each fixed changepoint time 
K = k): 

St — Sk 

Mi = 



t 



where 



t 

i=l 



This leads to the generalized CUSUM procedure, which computes the CUSUM statistic at each time t and stops the 
first time when the statistic hits threshold b: 

T = M\t>l: max l{St - Sk) ^1 fc)l > b], (25) 
I t-w<k<t ao y/t - k J 

where w is a time-window length such that we only consider the most recent w errors for changepoint detection, 
and the threshold b is chosen to control the false- alarm-rate, which is characterized using average-run-length (ARL) 
in the changepoint detection literature [47]. Typically we would choose w to be several times (for example, 5 to 10 
times) of the expected detection, then the window length will almost have no effect on the detection delay [48] . This 
threshold choice is detailed in Section 5.5. 



4.2 Distribution of e t 

In deriving the CUSUM statistics we have assumed that et are i.i.d. Gaussian distributed. A fair question to ask is 
whether et is truly Gaussian distributed, or even to ask whether et is a good statistic to use. Consider the complete 
data case. 

h=V Ut (x t )~x t . (26) 

In general et has non-zero mean, since we approximate the manifold using a union of subspaces. Moreover, due to 
curvature of the manifold, in general each dimension of the error vector, [et] m , are not independent and the variances 
of [e t }m may not be identical. In fact, no closed form expression for the distribution of e t = \\e t \\ exists. 

However, a Gaussian distribution is a good approximation for the distribution of et, since when D is large, et 
averages over errors across D dimensions. The QQ-plot of et from one of our numerical examples in Section 6 when 
D = 100 is shown in Figure 3. We will also demonstrate in Section 6.4 that the theoretical approximation for ARL 
using Gaussian assumption for e* is quite accurate. 
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Figure 3: Q-Q plot of et, for a D — 100 submanifold. 



5 Performance Analysis 

In this section, we first study the performance of MOUSSE, and then study the choice for the threshold parameter 
of the changepoint detection algorithm and provide theoretical approximations. A complete proof of convergence 
of MOUSSE (or GROUSE or PETRELS) is hard since the space of submanifold approximations we consider is 
non-convex. Nevertheless, we can still characterize several aspects of our approach. 

In Sections 5.1 and 5.2 below, we assume that there is complete data, and we restrict our approximation to a single 
subspace so that K t — 1. Assume the mean and covariance matrix of the data are given by c* and £*, respectively 
Assume the covariance matrix has low-rank structure: £* = diag{A*, . . . , A^,} with A m = 8* for m = d+ 1, . . . , D. 



5.1 Optimality of estimator for c* 

When there is only one subspace and the data are complete, the cost function (3) without the penalty term becomes 



t 

min^^-Kl-^^-c)!! 2 . (27) 

U.c L — * 



By writing each term as ||(7 — UU T )(x L — c)|| 2 = \\(I — UU T )(xi — x + x — c)|| 2 , we can write the cost function as 

tx[(I - UU T )S{I - UU T )} + ^—^-{x - c) T {I - UU T )(x - c), (28) 

1 — a 

where 

t t 

x = ^2a t - z x i , S = ^2a t - l {x l -x)(x l -x) T . (29) 

i=l i=l 

Since the second term in (28) is quadratic in c, it is minimized by c = x. Also note that in this case the optimization 
problem (3) decouples in U and c. Since our online update for ct is given by ct+i = act + (1 — u)xt, we have 



Ci 



= (1 — a) a* l xi + a* c , 



where the term a'co is a bias introduced by initial condition cq. Since E{xt} = c*, E{q} — > (1 — a) ■ jzr^ c * = 
Hence our estimator for c is proportional to the minimizer for (3) and is asymptotically unbiased. 



5.2 Consistency of estimators for A* and 5* 

In the following, we show that when there is complete data, if we have correct U — U*, then for each sample Xt, its 
projection [/3t] m is an unbiased estimator for A^, and ||/3i,_i_|| 2 is an unbiased estimator for J2m=d+i Kn- First note 

E{p t ] m | 2 } = E{[elU T (x t - c)] 2 } = elU T ^Ue m 

= ^m[U] m [U] m = A m , 
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(31) 



for m = 1, . . . , d, where e m denotes the m-th row of an identity matrix. We also have that 

E{||A, ± || 2 } =E{||(/-[/[/ T )(:r t - C )|| 2 } 

=tr{(J - UU T )Z*(I - UU T )} 

D 

= Kn ■ 

m—d+l 

Then from the MOUSSE update equations, as t — > oo 

E{A< m) } = E{(1 - a) ELi "'"WJ 2 + a*A< m) } A^, (32) 

for m = 1 , . . . , d and 

E{6 t } = E{(1 - a) E-=i "'"'ll^xf/P - d) + <**<*„} ^ E™= d+ i ^ = 5*. (33) 
Hence our estimators for A^, and 5* are asymptotically unbiased. Moreover, from (16), we have that 

E{e 2 }=E{||A, ± || 2 }= J2 Kn- (34) 

m—d+l 

Hence, the average approximation error of MOUSSE also converges 

t D 

E{ £t } = (l-a)^« t -E{e 2 }^ ]T A^. 

i=l m=d+l 

5.3 MOUSSE approximation errors 

As mentioned earlier, our multiscale subset model is closely related to geometric multiresolution analysis (GMRA) 
[14]. In that work, the authors characterize the favorable approximation capabilities of the proposed multiscale 
model. In particular, they prove that the magnitudes of the geometric wavelet coefficients associated with their 
algorithm decay asymptotically as a function of scale, so a collection of data lying on a smooth submanifold can 
be well-approximated with a small number (depending on the submanifold curvature) of relatively large geometric 
wavelets. These geometric wavelets are akin to the leaf nodes in our approximation, so the approximation results 
of [14] suggest that our model admits accurate approximations of data on smooth submanifolds with a small number 
of leafs. 

5.4 Missing data 

In Section 5.2 we have shown that with full data, our estimators are consistent. Note these estimators depend on (3 
and /?_[_. In the following we show that j3 and when using a missing data projection, are close to their counterparts 
when using a complete data projection. Hence, when the fraction of missing data is not large, the performance of 
MOUSSE with missing data is also consistent. In this section, we omit the subscripts j, k and t, and denote H t by 
f2 to simplify notation. Define the coherence of the basis U as [49] 

coh(U) = ^ma X \\UV u e m \\l (35) 
a m 

Theorem 1. Let e > 0. Given x = v + w, and w is a white Gaussian noise with zero mean and covariance matrix 
°~ 2 Idxd- Let f3 = U T (x — c), and /3q = Uq(xq — cq). If for some constant £ £ (0, 1), 

D 



101 > max ^ -coh(U)dlog(2d/e), - — 

' 3 y J 81 1 ' ' 3 (1 -£)log(2D/e) 



then with probability at least 1 — 3e 

'111 2 < 9 

'{l-lf |0| ""' V ^" y " ' " (1 — £) 2 |0| 2 ' 

where 



II A flll2^o (l + g) 2 d urn\\ 112 , 2 (64/9)^ 2 

Wn~l3\\ 2 <2- — • — • coh{U)\\q\\ + a (36) 



/ 2 ^max|^ log(i/£)) 



mi 



dq = (I-UU T )(v - c). 
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This theorem shows that the fraction of non-zero entries, |fi|, should be on the order of max{dlogc?} or {D/ log(D)} 
for accurate estimation of /3q. We have a similar bound for the difference between f3± and /3n,_i_, based on similar 
techniques used to derive Theorem 1. The first term in the error bound (36) is proportional to \\q\\ , the approximation 
error related to the distance of v from U, and the second term in (36) is due to noise. 

Proof of this theorem can be found in Appendix A. This theorem can be viewed as an extension of [49, Theorem 
1] to include noise, and also to directly bound ||/3 — (3q\\ instead of bounding \\vq — Uaf3a\\ using ||u — U/3\\. 



5.5 Choice of threshold for changepoint detection 

In accordance with standard changepoint detection notation, denote by E°° the expectation when there is no change, 
i.e., Eh , and by E fc the expectation when there is a changepoint at K = k, i.e., E|H lire =fc. The performance metric for 
a changepoint detection algorithm is typically characterized by the expected detection delay sup fc>0 E fe {T — k\T > k} 
and the average-run-length (ARL) E°°{T} [47]. Typically we use E°{T} as a performance metric since it is an upper 
bound for sup fc>0 E k {T - k\T > k}. Note that the CUSUM statistic (25) is equivalent to 

T = m£{t > 1 : max l St ~ 5fc l > h } (37) 

t- w <k<t y/t^k 

where St = X)i=i( e * — A*o)/o"o- Under Ho, we have (e, — no)/o~o i.i-d. Gaussian distributed with zero mean and unit 
variance. Using the results in [50], we have the following approximation. When b — > oo, 

o J xv*{x)dx 

where v{x) = (i/2^$(^/^+V(° )% < t ) ^ x ) ana - &( x ) are the pdf and cdf of the normal random variable with zero 
mean and unit variance. We will demonstrate in Section 6.4 that this asymptotic approximation is fairly accurate 
even for finite 6, which allows us to choose the changepoint detection threshold to achieve a target ARL without 
parameter tuning. 



6 Numerical Examples 

In this section, we present several numerical examples, first based on simulated data, and then real data, to demon- 
strate the performance of MOUSSE in tracking a submanifold and detecting changcpoints. We also verify that our 
theoretical approximation to ARL in Section 5.5 is quite accurate. 



6.1 Tracking a static submanifold 

We first study the performance of MOUSSE tracking a static submanifold. The dimension of the submanifold is 
D = 100 and the intrinsic dimension is d = 1. Fixing 9 € [—2, 2], we define v(9) £ R D with its n-th element 

[v(0)] n = l/V2^e-^~ 9)2 /^\ (39) 

where 7 = 0.6, and z n = —2 + An/ D, n = 1, . . . , 100, corresponds to regularly spaced points between —2 and 2. 
This static submanifold is sampled by sampling different £ [—2,2] and generating corresponding points on the 
submanifold according to (39). The observation xt is obtained from (1), where the noise variance is a 2 = 4 x 10~ 4 . 
We set parameter values as a = 0.95, e = 0.1, n = 0.03, and use PETRELS-FO. Figure 4 demonstrates that MOUSSE 
is able to track a static submanifold and reach the steady state quickly from a coarse initialization. In Figure 4 and 
the following numerical examples, the expected instantaneous fitting error is evaluated using N = 1200 draws from 
M. , denoted y\ , . . . , dn and computing the Monte Carlo estimate 

1 N 

i=l 

where Si denotes the minimum distance subset to yi. 
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Figure 4: MOUSSE tracking a static submanifold with D = 100 and d = 1. 



6.2 Tracking a slowly time varying submanifold 

Next we track a slowly time varying submanifold using MOUSSE, in a similar setting to Section 6.1 where D = 100 
and d = 1. The submanifold is also generated using (39), except that now we let 7 to be time varying: 

7 f 0.6- 70 i t=l,2,...,s 

11 \ 0.6-7o(2s-t) t = s + l,s + 2,...,2s { ' 

where parameter 70 controls how fast the submanifold changes. We choose 70 = 2 x 10~ 4 , s = 1000 with 20% missing 
data, and the parameters for MOUSSE are /1 = 0.03, e = 0.1, and a = 0.9. The result of the tracking can be found in 
an illustrative video in http://nislab.ee.duke.edu/MOUSSE/indcx.hml. Snapshots of this video at time t = 250 and 
t = 1150 are shown in Figure 1. In this display, the dashed line corresponds to the true submanifold, the red lines 
correspond to the estimated union of subspaces, and the + signs correspond to the past 500 samples, with darker 
colors corresponding to more recent observations. From this video, it is clear that we are effectively tracking the 
dynamics of the submanifold, and keeping the representation parsimonious so the number of subspaces used by our 
model is proportional to the curvature of the submanifold, and as the curvature increases and decreases, the number 
of subspaces used in our approximation similarly increases and decreases. The number of subspaces K t and fitting 
error as a function of time are shown in Figure 5. The red line in Figure 5 corresponds to e. Note that MOUSSE is 
able to track the submanifold, in that it can maintain a stable number of leaf nodes in the approximation and meet 
the target error tolerance e. 

6.3 Comparison of tracking algorithms 

We also compare the performance of different tracking algorithms presented in Section 3.4: GROUSE, PETRELS- 
GS and PETRELS-FO. We use E{et} defined in (40) as a comparison metric. In comparing the three methods, 
we set the parameters for each tracking algorithm such that the algorithm has the best possible performance. The 
comparison is displayed in Figure 6, where the horizontal axis is the submanifold changing rate 7, the vertical axis is 
the percentage of missing data, and the brightness of each block corresponds to E{et}. In Figure 6, PETRELS-FO 
performs better than GROUSE and PETRELS-GS, especially with a large percentage of missing data. Also note 
that for PETRELS-FO, the optimal parameters are fairly stable for various combinations of submanifold changing 
rate and percentage of missing data: the optimal parameters are a « 0.9, /j, « 0.2, and e w 0.1. Considering its 
lower computational cost and ease of parameter tuning, we use PETRELS-FO with MOUSSE for the remaining 
experiments in this paper. 

6.4 Changepoint detection example 

To verify our theoretical approximation for ARL, we perform Monte Carlo simulation. Direct simulation of T to 
obtain E°°{T} is very time-consuming because we typically want to choose a b such that E°°{T} is on the order 
of 10000. Hence we use an indirect simulation method commonly used in changepoint detection [48]. The indirect 
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Figure 5: MOUSSE tracking a slowly evolving submanifold with D = 100 and d = 1. Dashed red line depicts target 
error tolerance e. 
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(a) E{e t } of MOUSSE using GROUSE 



§ 1 o% 

05 

'E 



§50%\ 



70% 



















mm 


rngm 


H 


■ ■ 


H 



8 10 



1 

0.8 
0.6 
0.4 
0.2 




Changing rate (x 10 



(b) E{e t } of MOUSSE using PETRELS- 
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Figure 6: MOUSSE tracking a slowly varying submanifold using: (a) GROUSE, (b) PETRELS-GS and (c) 
PETRELS-FO. Horizontal axis corresponds to rate of submanifold's change and vertical axis corresponds to fraction 
of data missing. Brightness corresponds to E{e t }. 



method is based on the fact that when there is no changepoint, for large b, T is typically exponentially distributed. 
Hence, we have under Ho, 

P{T > m} = e- m / E °°< T \ 
which means we can simulate P{T > m} for fixed m and 6, and then obtain under H 

E°°{T} = -m/ logP{T > m}. (42) 

Using this formula, we generate 10000 Monte Carlo (MC) trials, each a slowly time-varying submanifold of duration 
t = 500. Then we use MOUSSE to track the submanifold, and obtain a sequence of errors. We form the CUSUM 
statistics using this sequence of errors, and find the fraction of sequences such that 

1 \(S t - S k ) - n (t - k)\ 
max max , > o, 

l<t<500 t-w<k<t cr y/t — k 

which is the estimate for P{T > 500}. Then we use the above formula to obtain E°°{T}. 

TABLE 1 shows the value of b suggested by theory for different ARLs and the value of b computed using the MC 
procedure described above. For comparison, wc also obtain the ARL by treating this as a single subspace tracking 
problem for which PETRELS-FO is employed. These values of b are in parentheses. 

To estimate the expected detection delay, we generate instances where the "ft in the model has an abrupt jump 
A 7 at time t = 200. 

f0.6- 7o t 4 = 1,2,..., 199 ( ) 

ft \ 0.6-A 7 - 7o t t = 200, 201,... ,400 1 ' 
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Tabic 1: Average run length (ARL) E°°{T}. In all cases the theoretical value is close to the MC estimate. 



ARL 


b 


b MC with 0% data 


b MC with 20% data 


b MC with 40% data 




missing 


missing 


missing 






MOUSSE 


Single Subspace 


MOUSSE 


Single Subspace 


MOUSSE 


Single Subspace 


1000 


3.94 


4.53 


3.94 


4.38 


3.99 


4.17 


4.03 


5000 


4.35 


5.87 


4.64 


5.38 


4.73 


5.02 


4.75 


10000 


4.52 


6.78 


4.94 


5.87 


5.02 


5.40 


5.03 



Then wc apply the CUSUM statistics and find the expected detection delay E°{T} with respect to t = 200 using 
10000 Monte Carlo trials. We compare the expected detection delay of MOUSSE and the single subspace tracking 
method. Results corresponding to big (A 7 = 0.05) and small (A 7 = 0.03) magnitude of jump of j t are given in 
TABLE 2, 3 respectively. Again, values in parenthesis are obtained by using single subspace tracking. The threshold 
6's are chosen according to the Monte Carlo results given in TABLE 1; e.g., for the cell corresponding to ARL = 1000 
and 0% missing data in TABLE 2 or 3, & should be set as 4.55 for MOUSSE and 4.28 for the single subspace method. 
TABLE 2, 3 demonstrate that MOUSSE has much smaller expected detection delay than a single subspace method. 

Tabic 2: Detection delay when jump of jt is A 7 = 0.05. Delays are significantly shorter with MOUSSE than with 
single subspace tracking. 



ARL 


delay with 0% data 


delay with 20% data 


delay with 40% data 


missing 


missing 


missing 




MOUSSE 


Single Subspace 


MOUSSE 


Single Subspace 


MOUSSE 


Single Subspace 


1000 


1.92 


60.26 


1.85 


56.99 


1.88 


53.71 


5000 


2.32 


91.66 


2.13 


86.82 


2.16 


80.48 


10000 


2.68 


104.91 


2.31 


98.57 


3.31 


91.21 



Table 3: Detection delay when jump of jt is A 7 = 0.03. Delays are significantly shorter with MOUSSE than with 
single subspace tracking. 



ARL 


delay with 0% data 


delay with 20% data 


delay with 40% data 


missing 


missing 


missing 




MOUSSE 


Single Subspace 


MOUSSE 


Single Subspace 


MOUSSE 


Single Subspace 


1000 


5.88 


110.40 


3.80 


105.12 


3.64 


99.16 


5000 


6.68 


163.80 


5.46 


157.52 


5.04 


147.66 


10000 


9.10 


183.94 


6.53 


176.05 


6.82 


165.12 



6.5 Real data 

A video from the Solar Data Observatory, which demonstrates an abrupt emergence of a solar flare, can be found 
on http://nislab.ee.duke.edu/MOUSSE/index.html. The Solar Object Locator for the original data is SOL2011-04- 
30T21-45-49L061C108. Also displayed is the residual e t of (26) obtained using MOUSSE, which clearly shows peaks 
in the vicinity of the solar flare. A frame from this dataset during a solar flare is shown in Figure 7a. The frame is of 
size 232 x 292 resulting in 67744 dimensional online data. To reduce difficulty of parameter tuning, we scale the pixel 
intensities in the dataset by multiplying the data by a factor of 10 -4 to be consistent with the scale of our simulated 
data experiments. The parameters we use are e = 0.3, /x = 0.03, and a = 0.85. The video and the snapshots in 
Figure 7 demonstrate that MOUSSE can not only detect the emergence of a solar flares but also localize the flare 
by presenting it, and these tasks are accomplished far more effectively with MOUSSE (even with d = 1) than with 
a single subspace. Note that with the single subspace tracking, the residual norm e(t) is not a stationary time series 
prior to the flare and thus poorly suited for changepoint detection. In the original images, the background solar 
images has bright spots that are slowly and changing shape, which makes detection based on simple background 
subtraction incapable of detecting small transient flares. In contrast, with our approach, with K t around 10, the 
underlying manifold structure is better tracked and thus yields more obvious error e(t) when anomaly occurs. 
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(a) Snapshot of original SDO data at t 
227 



(b) MOUSSE residual at i = 227 (c) 



Single subspace tracking residual at 
227 




100 150 200 250 300 



(d) e t by MOUSSE 






(e) CUSUM stats by MOUSSE (f) e t by single subspace tracking (g) CUSUM stats by single sub- 
space tracking 



Figure 7: Detection of solar flare at t = 227: (a) snapshot of original SDO data at t = 227; (b) MOUSSE residual 
it, which clearly identifies an outburst of solar flare; (c) single subspace tracking residual it, which gives a poor 
indication of the flare; (d) e(t) for MOUSSE which peaks near the flare around t = 227; (e) the CUSUM statistic 
(solid blue line) for MOUSSE with dashed red line indicating threshold b for ARL=10000; (f) e{t) for single subspace 
tracking; (g) the CUSUM statistic for single subspace tracking. Using a single subspace gives much less reliable 
estimates of significant changes in the statistics of the frames. 



Our second real data example is related to automatic identity theft detection. The basic idea is that consumers 
have typical spending patterns which change abruptly after identity theft. Banks would like to identify these changes 
as quickly as possible without triggering numerous false alarms. To test MOUSSE on this high-dimensional change- 
point detection problem, we examined the E-commerce transaction history of people in a datasct used for a 2008 
UCSD data mining competition at http : //www . cs . purdue . edu/commugrate/data_access/all_data_sets_more . php?search. 
For each person in this dataset, there is a time series of transactions. For each transaction we have a 31-dimcnsional 
real- valued feature vector and a label of whether the transaction is "good" (0) or "bad" (1). The full dataset was 
generated for a generic anomaly detection problem, so it generally is not appropriate for our setting. However, some 
of these transaction timeseries show a clear changepoint in the labels, and we applied MOUSSE to these timeseries. 
In particular, we use MOUSSE to track the 31-dimensional feature vector and detect a changepoint, and compare this 
with the "ground truth" changepoint in the label timeseries. In calculating the CUSUM statistic, we estimate the 
fio and Co of equation 25 from e\, . . . , e2o- After t = 20, every time the CUSUM statistic exceeds the threshold b and 
an changepoint is detected, we "reset" the CUSUM to only consider et after the most recently detected changepoint. 
This allows us to detect multiple changepoints in a timeseries. 

The effect of our procedure for one person's transaction history is displayed in Figure 8. We first see that MOUSSE 
accurately detects a temporally isolated outlier transaction at t = 38, after which the CUSUM is reset. After this, 
while MOUSSE does not generate particularly large spikes in et, the associated CUSUM statistic shows a marked 
increase near t = 70 and hits the threshold at t = 72, when the labels (not used by MOUSSE) change from (good) 
to 1 (bad). After this the CUSUM is repeatedly reset and repeatedly detects the change in the statistics of et from 
the initial stationary process. 



7 Conclusions 

This paper describes a novel multiscale method for online tracking of high-dimensional data on a low-dimensional 
submanifold, and using the tracking residuals to perform fast and robust changepoint detection. Changepoint 
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(b) Visualization of time-varying Attributes 



Figure 8: Credit card user data experiments, (a) From top to bottom: number of leaf nodes used by MOUSSE; 
e t ; CUSUM statistic (solid blue line) and theoretical threshold b corresponding to ARL = 10000 (dashed red line); 
ground truth label. Note that the CUSUM statistic has a false alarm due to an outlier at t = 38, and it starts 
increasing at t = 70 and frequently hits the threshold afterwards due to the changepoint at t = 70. In this case 
CUSUM catches both the outlier and the changepoint. (b) Demonstration of the time- varying Xt (user attributes): 
each column corresponds to the 31- dimensional attribute vector at a given time. The white spots corespond to the 
outlier at time t = 38. 



detection is an important subset of anomaly detection problems due to the ever-increasing volume of streaming data 
which must be efficiently prioritized and analyzed. The multiscale structure at the heart of our method is based on a 
Geometric MultiRcsolution Analysis which facilitates low-complexity piecewise-linear approximations to a manifold. 
The multiscale structure allows for fast updates of the manifold estimate and flexible approximations which can 
adapt to the changing curvature of a dynamic submanifold. These ideas have the potential to play an important role 
in analyzing large volumes of streaming data, which arises in remote sensing, credit monitoring, and network traffic 
analysis. 

While the algorithm proposed in this paper has been focused on unions of subspaces, an important open question 
is whether similar techniques could be efficiently adopted based on sparse covariance matrix selection [51,52]. The 
resulting approximation space may no longer correspond to a low-dimensional submanifold, but such structures 
provide good representations of high-dimensional data in many settings, and our future work includes tracking the 
evolution of a mixture of such structures. Issues related to non-Gaussian observation models, inverse problem settings, 
dynamical models, and optimal selection of the statistic used for changepoint detection (i.e., alternatives to et, as 
considered in [53]) all pose additional interesting open problems. 



A Proof of Theorem 1 

Proof. From (1) and (9) we have 

P = U T (v~c) + U T w, (44) 

Note that U T w is zero-mean Gaussian random vector with covariance matrix o~ 2 U T U = a 2 !. 

Next we consider the missing data case. Recall Vq, £ ~M) n \ xD is a projection matrix. Define wq = Vnw. From 
(9) we have 

Pn =U*(vn-c n ) + U*wn (45) 
Suppose in (1) we write v — c = p + q, with p G S and q e S , where S denotes the orthogonal subspace of S. 
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Hence, p = UU T (v - c) and q = (I - UU T )(v - c). Let p n = VnP, qa = Vaq. Hence, vq - cq = pa + qq. Note that 



So 

Hence 



={U^U n )- l U^UnU T {v-c) 
=U T (v-c). 

Pn = U T (v-c) + U*q n + U*w n . 



(46) 
(47) 
(48) 



ll/3o-/3|| 2 

<2\\U*q n \\ 2 +2\\U*w Q -U 
= 2\\(U^Un)- 1 U^qn\\ 2 + 2||[(t£ U^U^Va - U T ]w\\ 



T„„||2 



We will bound these two terms separately. 
First, note that 



(49) 



where |j A\\2 denotes the spectral norm of matrix A. Using [Lemma 2] in [49], we have that with probability 1 — e, if 
|0| > |dcoh(C/)log(2d/£), 

||^n|| 2 <(l + ^ 2 J§^coh(C/)[| g || 2 , 



where 6 



at least 1—e, 



' maX "|jq||^"^ l°s(V e )- Using [Lemma 3] in [49] we have that provided that < £ < 1, with probability 



\\(u£Un)-%< 



D 



(i-£)\n\ 

Combine these with (49), we have that with probability 1 — 2e, 

IK^r^gnll 2 < ■ 4r ■ coh(C/)ikH 2 . 



{I -if |fi| 



(50) 



(51) 



Next we examine the noise term. Define 

w=[(U^ l Un)- 1 U^ l Vn-U T }w, 
which is a zero-mean Gaussian random vector with covariance matrix 

T = a 2 (U^U Q y 1 -a 2 I, 

where we have used the fact that VqVq = I. Hence we bound the tail of the noise power using Markov inequality 



„7.l|2 



> 2tV) < e-^e^ll 2 /^ 2 )} < 2De- T 



(52) 



provided that r is sufficiently large such that the maximum eigenvalue is smaller than r: A max ((J7Q Uq)^ 1 ) < r, 
i.e., t > D/[(l—£)\Cl\], by noting that A max ((E^n ^o) _1 ) = || {Uq ) 1 1 1 2 - The last equality in (52) is because, under 
such condition: 

,\\w\\ 2 /(2ra 2 ) 1 



e ||x||»/(2TO(2 7 r)- D / 2 |r|- 1 /2 e -J* r " x dx 



= (2 7 r)- I? / 2 |r|- 1 / 2 / e-^V-^^^dx 

= iri-^ir- 1 - r^a- 2 ^- 1 ' 2 = \i- r-V^ri- 1 / 2 

= |(l + l/r)/-r- 1 (^n)- 1 |- 1/2 
< £>[(! + 1/t-) - T- 1 ||(I7 c I'l7n)- 1 || 3 ]- 1/2 
D 



(53) 



< D 



= D 



(1 + 1/t)- 



r(i-£)\n\ 



-1/2 



D 



t\1-£)\Q\ 



-1) 



-1/2 
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In the last inequality, we have used (50). Note that D/[(l — > 1, and the upper bound in (53) is smaller 

than 2D if r > |((xz^jnj — or T > I (i-?) | n | • Now we set 2De~ T = e, if e is sufficiently small such that 
log(2D/e) > | (!_f)|n | • Hence we have when [Qj > | lo ^ (2£ ,y e) , |jw|| 2 < ^ (i^)*|n| a with probability 1 - s. 
Finally, combining (51) and (53), we obtain the statement in Theorem 1. 

□ 
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